Unordered N-gram Representation Based on Zero-suppressed BDDs for Text Mining and Classification

نویسندگان

  • Ryutaro Kurai
  • Shin-ichi Minato
  • Thomas Zeugmann
چکیده

In this paper, we present a new method to analyze unordered n-grams by using ZBDDs (Zero-suppressed BDDs). n-grams have been used not only for text analysis but also for text indexing in some search engines. We newly use a variation of n-grams called unordered n-grams. Unordered n-grams abstract from the position of the characters in each n-gram, i.e., they just deal with the range of ordinary n-grams or, synonymously, they reduce ordinary n-grams to sets of symbols. Furthermore, ZBDDs are known as a compact representation for combinatorial item sets. We newly apply ZBDD-based techniques to handle efficiently unordered n-grams. We conducted experiments for generating unordered n-gram statistical data for given real document files. The results obtained show the potentiality of our new methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-Gram Analysis Based on Zero-Suppressed BDDs

In present paper, we propose a new method of n-gram analysis using ZBDDs (Zero-suppressed BDDs). ZBDDs are known as a compact representation of combinatorial item sets. Here, we newly apply the ZBDD-based techniques for efficiently handling sets of sequences. Using the algebraic operations defined over ZBDDs, such as union, intersection, difference, etc., we can execute various processings and/...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Symmetric Item Set Mining Using Zero-suppressed BDDs

(Abstract) In this paper, we propose a method for discovering hidden information from large-scale item set data based on the symmetry of items. Symmetry is a fundamental concept in the theory of Boolean functions, and there have been developed fast symmetry checking methods based on BDDs (Binary Decision Diagrams). Here we discuss the property of symmetric items in data mining problems, and des...

متن کامل

A Theoretical Study on Variable Ordering of ZBDDs for Representing Frequent Itemsets

(Abstract) Recently, an efficient method of database analysis using Zero-suppressed Binary Decision Diagrams (ZBDDs) has been proposed. BDDs are a graph-based representation of Boolean functions, now widely used in system design and verification. Here we focus on ZBDDs, a special type of BDDs, which are suitable for handling large-scale combinatorial itemsets in frequent item-set mining. In gen...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007